Identifying short motifs by means of extreme value analysis 
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Abstract. - The problem of detecting a binding site - a substring of DNA where transcription 
factors attach - on a long DNA sequence requires the recognition of a small pattern in a large 
background. For short binding sites, the matching probability can display large fluctuations from 
one putative binding site to another. Here we use a self-consistent statistical procedure that 
accounts correctly for the large deviations of the matching probability to predict the location of 
short binding sites. We apply it in two distinct situations: (a) the detection of the binding sites for 
three specific transcription factors on a set of 134 estrogen-regulated genes; (b) the identification, 
in a set of 138 possible transcription factors, of the ones binding a specific set of nine genes. In 
both instances, experimental findings are reproduced (when available) and the number of false 
positives is significantly reduced with respect to the other methods commonly employed. 



Introduction. — Understanding the regulation of 
gene expression, i.e. the cellular process that controls the 
amount and timing of appearance of the functional prod- 
uct of a gene, is a challenging task. The expression of a 
gene is controlled by proteins called transcription factors, 
which bind to short segments of DNA called binding sites 
(BSs). BSs are located on long strings of DNA of about 
2000 nucleotides (the promoters), upstream of genes. The 
problem of identifying BSs clearly plays a central role for 
elucidating the mechanics of gene regulation. The detec- 
tion of BSs can be carried out experimentally by several 
high-throughput techniques (see e.g. [1]), though still at 
very high cost. From such measurements it is possible to 
infer the frequency with which every nucleotide (A, C, G 
or T) appears in the BS of a given transcription factor. 
These data ultimately represent the binding specificity 
of transcription factors and arc usually encoded in the 
so-called Position-Specific Frequency Matrices (PSFMs) 
which are catalogued in e.g. the JASPAR [2] and Trans- 
Fac [3] databases. The entry fij of a PSFM gives the 
frequency with which nucleotide j £ {A, C, G, T} appears in 
position i £ {1, . . . ,£} on the BS [l denoting its length) 
of a given transcription factor, with fij = 1. Thou- 
sands of such matrices are available today, covering BSs 
for many different transcritption factors. Developing ef- 
fective computational methods to predict the position of 
BSs on the promoter from the known PSFMs would pro- 
duce a crucial advantage in terms of identifying new BSs 



and improving the characterization of binding specificity. 
From a theorist's perspective, the question in somewhat 
simplified terms is the following: given a long string of 
DNA (the promoter), which short substring is the best 
putative BS according to the experimental PSFM? 

In order to tackle this issue, several methods have been 
developed and are presently used [4-6]. Most of them 
assume a Markovian model as the underlying string gen- 
erator and consist in (i) a maximum-likelihood procedure 
to identify candidate substrings on the promoter, and (ii) 
a statistical test to evaluate the significance of the re- 
sults against a benchmark in which log-likelihoods are 
Gaussian-distributed. For short BSs this second step is 
particularly delicate because their log-likelihoods are ex- 
pressed as sums of contributions coming from single nu- 
cleotides treated independently. Therefore they can not be 
approximated by Gaussian random variables [7] , since the 
number of terms in the sum is too small. Indeed we show 
below that the probability distribution function (pdf) of 
the maximum of the log- likelihoods for short BSs is rarely 
the extreme value distribution for a Gaussian random vari- 
able. The prediction of short BSs thus requires a more 
precise method that is able to account correctly for large 
deviations in evaluating their statistical significance. 

In this work we apply the standard approach used in 
statistics to evaluate the distribution of the maximum, 
i.e. extreme- value theory and particularly the Peak-over- 
Thrcshold (POT) method, to estimate the statistical sig- 
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nificance of putative BSs. This technique has been em- 
ployed in financial analysis [8] and meteorology (see e.g. 
[9]). We firstly test our method by identifying, among 138 
transcription factors listed in JASPAR, the ones binding 
a specific set of skeletal muscle specific genes, reproduc- 
ing experimental results with a marked reduction of false 
positives in comparison with other computational meth- 
ods [10,11]. Subsequently, we apply it to the detection of a 
BS that is widely studied experimentally, that is ERE [12] 
(estrogen responsive element, £ = 13), and of two other 
BSs that are believed to be functionally related to ERE 
(called AP2 and C/EBP, both £ = 12) on a data set of 
134 promoters for genes whose expression is altered upon 
treatment with an estrogen-sensitive growth factor [13]. 

Setup. — The basic setting of probabilistic schemes is 
in general terms as follows. Consider a string of length L 
drawn from a finite alphabet A. We assume that it can be 
divided in two parts: a background consisting of L — £ let- 
ters and a motif of length £ (the BS). These are produced 
in general by different stochastic models Pb and P m (the 
latter encoded by the PSFM, in the case discussed above). 
Neglecting all correlations, the probability of observing a 
certain sequence Sk = {ai, . . . , a^} of length L including 
a motif that starts at location k + 1 is simply 



P(S k ) 
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where Pf,(ai,i) (resp. P m (ai,i)) represents the probability 
to observe letter G A in position i in the background 
(resp. the motif). It is clear that the motif can be identi- 
fied as the substring that maximizes the second factor in 
the right-hand side of (1), or equivalently its logarithm, 
i.e. 

k+e 



(ai,i 



(2) 



W k = t lo S p m{aui) -log Pb{ 

i=k+l 

since larger values of Wk suggest that the string starting at 
fc + 1 is more likely to be a motif than a common substring. 
Wk is called the score of the substring. Moving k along 
the string, one can then compute L — £ + 1 scores, one 
for each substring of length £, and select the one with the 
highest score as the most likely motif. We shall henceforth 
denote by /c* + 1 the starting locus of the score-maximizing 
substring. Note that multiple maxima may occur. 

Statistical significance. — The problem at this point 
is to establish how significant W-V+i is in statistical terms, 
i.e. how unlikely it is that a particular score has arisen 
by chance. To this aim, one normally assumes that scores 
have a Gaussian distribution and are uncorrelatcd along 
the sequence (i.e. the Wk's are independent random vari- 
ables for different fc), and evaluates the likelihood of a 
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Fig. 1: Fraction (j> of random (3 rd order Markovian) realizations 
that do not pass the Lilliefors Gaussianity test for the score 
distribution versus motif size I (the test is passed by a Gaussian 
sample). Averages are over 300 (information-rich PSFM) and 
500 (information-poor PSFM) samples, respectively. 



given maximum score by employing a Gumbel distribu- 
tion 1 . Unfortunately, in many cases motifs are short so 
the number of terms to be summed up in (2) is too small 
for generating a Gaussian random variable. The distri- 
bution of maxima may thus deviate significantly from a 
Gumbel distribution. To appreciate how the histogram of 
scores varies with £ one can study the pdf that emerges by 
applying artificial PSFMs on random promoters. We have 
constructed an ensemble of promoters of length 10000 us- 
ing the nucleotide frequencies in the human genome as the 
underlying model. On each of these we tested a different 
artificial PSFM of size £ x 4 for £ e {5, ... , 32}. We have 
considered two cases: information-rich PSFMs, with non- 
zero entries only for two (randomly selected) nucleotides 
for each position; information-poor PSFMs, which have in- 
stead entries drawn from a uniform distribution on [0,1 /2] 
(the normalization conditions being obviously enforced). 
These choices represent limiting cases, since real data are 
typically in-between these alternatives. For each realiza- 
tion we have carried out a Lilliefors test to probe the nor- 
mality of the score distribution (other normality tests such 
as the Jarque-Bera test return a very similar picture) . Re- 
sults for the fraction <j> of samples that do not pass the test 
are shown in Fig. 1. It is clear that the Gaussian hypoth- 
esis is inadequate for short motifs in both cases. Remark- 
ably, for information-poor PSFMs it is troublesome also 



1 Recall that the limit cumulative distribution function of the 
maximum M n of a sequence of n independent, identically-distributed 
random variables is given by the generalized extreme-value law 

lim Prob{M„ < x} := H e (x) = e"[ 1+f:c J~ lA , 1 + £x > (3) 

n — >cc 

where the shape parameter £ £ M allows to distinguish three types 
of limiting behaviors, depending on whether £ > (Frechet), £ < 
(Weibull) or £ -> (Gumbel). 
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for longer motifs. Note that typical transcription factors 
BSs have 6 < I < 20. Clearly, it would be important to 
outperform existing computational methods in the pres- 
ence of information-poor PSFMs, i.e. when experimental 
data on motifs are less sharp. 

Accounting for large deviations. — The standard 
methodology to deal with tail events consists in select- 
ing a high threshold and studying the exceedances of the 
threshold. The basis for this is a theorem by Pickand [14]. 
In simplified terms, it states that given a random vari- 
able X and a threshold u > 0, the distribution function of 
Y = X — u (the 'excess' over u) is such that 



lim Pr{0 < Y < u} := Gt Jy) = 1 - I 1 + — 



-i/e 



(4) 

where a > is a scale parameter, £ is the shape parameter 
of the distribution of the maximum value of the random 
variable X (see footnote 1), and xp is the right extremum 
of the distribution function F(x) = Vr{X < x}, defined 
by xp = inf{x : F(x) — 1}. G^o- is called the generalized 
Pareto distribution (GPD). In other words, the GPD is a 
good approximation for the distribution of excesses of a 
random variable over sufficiently high thresholds. Hence, 
given the set of scores {Wk} and a threshold u, one can ob- 
tain estimates £ for £ and a for <r by fitting the distribution 
of excesses over u to a GPD. With £ and a it is possible 
to evaluate the probability to observe a score larger than 
u using (3) . Clearly, the smaller is this quantity, the more 
significant is the result from a statistical viewpoint. The 
parameter estimates will however depend on the chosen 
threshold, i.e. £ = £(tt) and a = a(u). The problem now 
consists in choosing u optimally, so that the condition for 
the validity of Pickand's theorem is verified with good ac- 
curacy and one still has enough data above the threshold 
to be able to estimate the unknown parameters. As well 
explained in [8] , to this aim one can resort to the following 
property: let xi, . . . ,x n be n independent realizations of 
a random variable with unknown pdf F, and let 



e n (u) 
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(5) 



be the sample's mean excess over a fixed threshold u, with 
6(x) Hcaviside's step function. Then, if F is a GPD, 



>(«) 



(6) 



This implies that when the empirical plot e n (u) versus u 
follows approximately a straight line with a certain deriva- 
tive above a value u of u, then the excesses over u follow 
approximately a GPD with shape parameter related to the 
observed derivative. This allows for an optimal selection 
of u and, in turn, of £ and a. 

As said above, once we have estimated these parame- 
ters we should evaluate the statistical significance of the 



scores via (4). This can be accomplished via a Peak-over- 
Threshold (POT) analysis [8, 15]. Consider the excesses 
of the scores over u, Yk = Wk — u>0 (scores that do not 
exceed u are hereafter neglected). Given that the number 
N of excesses above a threshold is a Poissonian variable 
(see e.g. [16]), one easily understands that 



Vk = 



Pr < max 

je{o,i,...,L-t} 



Yj > Y k 



1 - Pr 



< max Yj < Yfc > 
{je{o,i,...,L-e} J 



= 1 — cxp 



-A (1 + £Y k /a) 



(7) 



The additional parameter A coming from the Poisson 
distribution can be estimated from the data simply as 
A = N/(L — £ + 1), where iV is the actual number of 
scores falling above u in our sample. Clearly, 7-Vs should 
be as small as possible for Yk to be close to the maximum. 
A precise condition for real BSs prediction is discussed 
below. 

Application to the detection of ERE. We have 
analyzed a set of 134 promoters whose expression profile 
is upmodulated by estrogen, a hormone produced in the 
ovaries. Estrogen diffuses across the cell membrane into 
the cell, where it interacts with hormones called estrogen 
receptors. Once activated by estrogen, receptors act pri- 
marily as transcription factors to regulate the expression 
of certain genes by binding to DNA. Estrogen receptors are 
widely studied in the biomedical literature since estrogen 
is related to the development and growth of most types 
of breast cancers. Indeed, breast cancer monitoring com- 
monly includes tests for expression of the estrogen recep- 
tor, and reducing the supply of estrogen is part of breast 
cancer therapy. The interaction of an estrogen receptor 
with DNA occurs at a BS called estrogen responsive ele- 
ment (ERE, £ = 13). The position of ERE is known exper- 
imentally on some promoters but it would be important to 
extend this knowledge to other genes that are sensitive to 
estrogen. Furthermore, binding at ERE is believed to be 
cooperatively linked to binding at two other motifs, called 
AP2 (£ = 12) and C/EBP (£ = 12). Whether such mo- 
tifs are present on all promoters for estrogen-upmodulated 
genes is however not known. 

We have screened our data set for the (known) pres- 
ence of ERE and for the (to be ascertained) presence of 
AP2 and C/EBP. The PSFMs for the latter genes have 
been extracted from the TransFac database 2 . For ERE we 
have used the PSFM derived in [17]. For the sake of clar- 
ity, we have subdivided the 134 genes in two groups: the 
first contains the 14 genes for which experimental knowl- 
edge is available (TFF1, STS, CRKL, NROB2, CYP1B1, 
FEM1A, CYP4F11, FOXA1, RPS6KL, NRIP1, CTSD, 
GAPD, GREB1, IGFBP4) [13,18]; for the remaining 120 



2 Accession numbers: M00189 and M00770. 
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genes information is available only from computational 
studies through the NCBI database [19]. 

It is now important to discuss the conditions for re- 
jection of a putative motif. Statistically significant sub- 
strings of DNA (indexed k) should satisfy two conditions. 
On one hand, the value pjp of Vk calculated on the true 
promoter should be smaller than a confidence level V c , 
since it would be desirable to minimize the probability of 
finding a score larger than W k - V c must be chosen so as 
to guarantee that when the above procedure is applied to 
an 'engineered' promoter containing a certain number of 
motifs, all of these are detected correctly. In the cases we 
analyzed, V c turns out to vary in a range between 0.02 
and 0.001. 

Secondly, V k ^ should be larger than the value P^ one 
would obtain when looking for a real motif on a random 
promoter, e.g. one drawn uniformly from { A , C , G , T} , with 
the same threshold Y k used for the real promoter. This 
condition enforces the expectation that the score of a cer- 
tain motif computed on a real DNA sequence should be 
higher than that computed on a random string of DNA, 
where the motif can only occur by chance. Statistical ac- 
curacy can be increased by considering an ensemble of ran- 
dom promoters rather than just one, and computing P k 
as the average Pk over the ensemble. Indeed, some ran- 
dom strings will produce larger scores than other strings, 
so it is important to compare the true promoter directly 
with the random ones, especially so if the true promoter 
contains the motif. 

The condition that relevant substrings of length I start- 
ing at locus k + 1 should satisfy is then 

^fe r) < V { k t] < V c (8) 

For comparison, we have considered another condition, 
less stringent than (8), namely that both 

P { k ] < V c and < V c (9) 

We shall denote the latter as the weak condition and the 
former as the strong one. These conditions differ from the 
one which is commonly used. Indeed, normally one only 
looks for motifs that are unlikely to appear in a random 
promoter, i.e. the only significancy criterion is P k < P c . 
We show below that our setting ultimately allows for a 
reduction in the number of false positives with respect 
to other methods, while keeping the same predictive effi- 
ciency {e.g. the number of true positives) in test cases. 

Ultimately, the algorithm we have used to search for 
ERE, AP2 and C/EBP on each of the 134 promoters can 
be summarized as follows. 

1. Define P m and Pb, see (1). The former is given by the 
experimental PSFMs of ERE, AP2 and C/EBP. For 
Pb we have used a 3-step Markovian model (different 
choices do not impact results significantly) 

2. Compute the scores {W^ }fc=o for the true promoter 
and {W k ^} k ~g for an ensemble of random promoters 
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Fig. 2: Sequence logo of ERE emerging from the 134 promoters 
studied. Top: strong significancy condition. Bottom: weak 
significancy condition. 



generated via a prescribed Markov model (e.g. ran- 
domly and unformly from {A,C,G,T}). 

3. Estimate the optimal parameters (wt,£ t ,<7t) for the 
true promoter and (u r , £ r , <r r ) for the random promot- 
ers. 

4. Calculate the probability P k ^ for the true promoter, 

see (7), and V k as the average Vk over the random 
promoters. 

5. Select motifs with index k satisfying the significancy 
conditions, either (8) or (9). 

It is worth noting that more than one substring may pass 
the significancy tests. In this respect, our choice of com- 
puting P k ^ , that is of considering the likelihood of observ- 
ing a particular substring on the real promoter alongside 
P k ^ , allows us to draw sharper conclusions on suboptimal 
putative motifs since the distribution of the largest scores 
on real and random DNA should differ if a motif is actu- 
ally present on the real sequence. The fact that more than 
one motif may occur obviously doesn't imply cooperation 
at the biological level. The method can however be mod- 
ified to account for this aspect, e.g. to identify pairs of 
correlated motifs [20]. 

Results. — We have detected the presence of ERE on 
all of the 134 promoters, in agreement with experimental 
knowledge. In Fig. 2 we display the sequence logo 3 [21] 
relative to whole data set of 134 promoters. This should 
be compared with the sequence GGTCA*** TGACC (* =any 
nucleotide) constructed by inserting the most frequent nu- 
cleotide in each position and a * in positions where, ex- 
perimentally, every nucleotide can be present. Notice that 
the sequence is palindromic, in the sense that the first five 
bases link to the last five in reverse order (with the rules 
A-T, C-G). It is clear from Fig. 2 that our method recovers 
this property. 

3 The frequencies of bases at each position correspond to the rela- 
tive heights of letters. The degree of sequence conservation is instead 
represented by the total height of a stack of letters, in units of bits 
of information. 
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Fig. 3: Sequence logo of AP2 emerging from the 134 promoters 
studied. Top: strong significancy condition. Bottom: weak 
significancy condition. 




Fig. 4: Sequence logo of C/EBP emerging from the 134 pro- 
moters studied. Top: strong significancy condition. Bottom: 
weak significancy condition. 



On the contrary the presence of AP2 and C/EBP was 
not found in all of the 134 genes (see below for details 
from a reduced data set). The resulting sequence logos are 
shown in Figures 3 and 4. The former should be com- 
pared with the sequence CGCCCGCCGGCG built with the ex- 
perimentally most frequent nucleotides at every position. 
Note however that the PSFM for AP2 (from TransFac) 
includes a small number of known BSs (13 at the time of 
writing this article). For C/EBP, the sequence logo is to 
be compared to the experimental highest frequency string 
[G/A] AATTTGGCAAA, where the first position is occupied 
by guanine or adenine with the same frequency. (In this 
case a much larger data sample is available to build the 
PSFM). 

In summary, for the genes we considered the method 
returns sequences that are in a very good agreement with 
the available experimental knowledge on BSs. 

Let us now consider the restricted data set formed by 
the 14 genes that have been directly accessed in experi- 
ments, at least for ERE. In Table 1 we show the outlook 
of results for the three motifs we considered. With the 
strong significancy condition, our prediction is that AP2 
and C/EBP are not present on all of the 14 genes, at odds 
with ERE. An experimental validation is not yet available. 

Let us now focus on one gene from the data set, namely 
GAPD (similar results arc obtained for the other genes), 
and consider ERE. In Fig. 5 we display the probability- 
probability (PP) and quantile-quantile (QQ) plots for 
GAPD. The former shows the empirical probability dis- 
tribution of excesses versus a GPD; the latter focuses on 
the tails, showing the empirical quantiles of the distribu- 
tion of excesses extracted from the data on GAPD versus 
the quantiles estimated from a GPD. These types of plots 
provide simple measures of plausibility of a certain model. 
One sees a convincing agreement between the data and an 
extreme- value distribution. 



4 For a random variable with probability distribution F(x) and 
for any < p < 1, one defines the quantile corresponding to p as 
x(p) = inf{x : F(x) > p}. 



Application to skeletal- muscle specific genes.. 

To have an idea of the performance of the method con- 
cerning false positives, we have tested it against a known 
biological benchmark. Specifically, we have considered the 
full set of nine skeletal-muscle specific genes studied in [11]. 
This set is well studied experimentally. In particular, it is 
known that six of the transcription factors from the JAS- 
PAR database attach to them [10, 11]. The correspond- 
ing motifs have lengths varying from 6 to 12 nucleotides. 
The best available computational technique, the Tomovic- 
Oakeley (TO) method [11], takes dependencies between 
sites into account and is able to identify correctly five of 
the six factors. Table 2 compares the performance of our 
algorithm with that of TO and with the best available 
(to our knowledge) algorithm based on cross-species com- 
parison, ConSite [22]. We have chosen our parameters to 
obtain at least as many true positives as Tomovic-Oakeley. 
For this setting, the number of false positives is consider- 
ably lower in our case. 

Conclusions. — Summarizing, we have accounted for 
large deviations in the distribution of scores for short BSs 
by a technique that combines well-known properties of 
extreme- value distributions and a POT analysis. The im- 
portance of fluctuations becomes clear if one studies the 
score distribution in a random setting. This approach 
allows for a self-consistent estimation of the statistical 
significance of putative motifs. The general problem of 
recognizing a small pattern in a large background how- 
ever presents many open issues. Among these we men- 
tion those that have perhaps a more direct biological im- 
plication. First, for obvious reasons it would be impor- 
tant to devise methods yielding a still smaller number 
of false positives. To this aim a deeper analysis of the 
performance on artificial data set would be required, so 
as to improve the estimation of our parameters and to 
compare the performances of different methods on depen- 
dence on I and on the structure of the PSFM. Second, 
one should address the issue of cooperation between tran- 
scription factors. In principle, this requires overcoming 
the independent-nuclcotidcs approximation and develop- 
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Gene 


ERE 


AP2 


C/EBP 


TFF1 


Yes 


No 


Yes 


STS 


Yes 


No 


No 


CRKL 


Yes 


Yes 


No 


NROB2 


Yes 


No 


Yes 


CYP1B1 


Yes 


No 


No 


FEM1A 


Yes 


Yes 


Yes 


CYP4F11 


Yes 


Yes 


No 


FOXA1 


Yes 


No 


Yes 


RPS6KL 


Yes 


No 


No 


NRIP1 


Yes 


No 


Yes 


CTSD 


Yes 


Yes 


Yes 


GAPD 


Yes 


No 


No 


GREB1 


Yes 


No 


No 


IGFBP4 


Yes 


No 


Yes 



Table 1: Table representing the presence (Yes) or absence (No) 
of the ERE, AP2 or C/EBP motif on the genes reported in the 
first column. Results are shown for the strong significancy 
condition. 



PP plot 



as o 

i 



0.0 02 0.4 OLB C'.B 1.0 

Empirical 




Gene 


ConSite 


TO 


BT 


ALDOA 


5/81 


5/78 


5/70 


DES 


5/80 


5/74 


5/70 


MYOG 


5/87 


5/85 


6/76 


MYL1 


6/86 


5/75 


5/71 


TNNI1 


5/81 


5/78 


5/69 


MYH7 


5/77 


4/75 


5/76 


MYH6 


5/83 


5/78 


5/66 


ACTA1 


6/80 


5/77 


5/67 


ACTC1 


5/84 


5/77 


5/73 



Fig. 5: PP-plot (top) and QQ-plot (bottom) for GAPD. 



Table 2: Comparison between the performance of the ConSite, 
the Tomovic-Oakeley (TO, including site dependencies) and 
the present (BT) algorithm on the set of 9 genes studied in [11]. 
The first number in each entry gives the number of motifs found 
on the genes (out of 6), the second gives the number of false 
positives retrieved in the JASPAR database. 



ing techniques that account for score correlations along the 
sequence. Methods accounting for correlations already ex- 
ist but they need, at present, more parameters and larger 
data set to obtain a reasonable statistical significance. In 
our case, it is possible to take into account the effect of 
dependencies by properly grouping scores and applying 
extreme-value theory to the block scores. This extension 
is the object of further work [20]. Clearly, more effective 
methods would be very welcome and refined statistical and 
probabilistic tools are likely to play a major role in their 
development. 



We are deeply indebted with F. Cordero for providing us 
with the modified PSFM for ERE, and with R. Calogero 
for many important discussions and for a useful collabo- 



ration. 
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